About the project

First day

Starting a bit late, but very hopeful! I am really interested in the topic, but technical difficulties are driving me insane. Also very lost for now.


Chapter 2. Regression and model validation

Data exploration

Let’s start by reading the dataset that we previously wrangled.

learning2014 <- read.csv("./data/learning2014.csv")

For data structure, let’s see

dim(learning2014)
## [1] 166   8
str(learning2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
##   X gender age attitude     deep  stra     surf points
## 1 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6 6      F  38      3.8 4.750000 3.625 2.416667     21

These data result form a survey conducted during an Introduction to Social Statistics course in the fall of 2014. This particular dataframe contains responses from 166 people who took the exam. There are 7 data entries for each participant: gender, age, attitude towards learning, questions related to deep learning, surface learning and strategic learning, as well as the exam grade.

summary(learning2014$gender)
##   F   M 
## 110  56

110 women and 56 men undertook the survey

summary(learning2014$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   21.00   22.00   25.51   27.00   55.00

Their age ranged from 17 to 55 years old with a mean of 25.51

summary(learning2014$attitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   2.600   3.200   3.143   3.700   5.000
summary(learning2014$deep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   3.333   3.667   3.680   4.083   4.917
summary(learning2014$stra)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.625   3.188   3.121   3.625   5.000
summary(learning2014$surf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   2.417   2.833   2.787   3.167   4.333

Their attitude, as well as the questions pertaining to deep, surface and strategic learning, were measured on a scale from 1 to 5

summary(learning2014$points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   19.00   23.00   22.72   27.75   33.00

The exam grades varied from 7 points to 33 with a mean of 22.72 The next step is to do the graphical exploration.

library(GGally)
## Loading required package: ggplot2
library(ggplot2)
pairs(learning2014[-1], col = learning2014$gender)

p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

The plot demonstrates paired correlations between all the variables except gender. It is a binary variable and should not be simply correlated with continuous ones. Here gender is used in all of the paired plots for a more detailed visual analysis. We can also see the boxplots for each variable and analyze them visually. None of the variables contains an alarming amount of outliers except for age.

Probability plots

Before starting any analysis, it is good to look at the probability plots of each variable.

Options:
1. qqnorm( ) function to create a Quantile-Quantile plot evaluating the fit of sample data to the normal distribution.
2. The fitdistr( ) function in the MASS package provides maximum-likelihood fitting of univariate distributions. The format is fitdistr(x, densityfunction) where x is the sample data and density function is one of the following: “beta”, “cauchy”, “chi-squared”, “exponential”, “f”, “gamma”, “geometric”, “log-normal”, “lognormal”, “logistic”, “negative binomial”, “normal”, “Poisson”, “t” or “weibull”.
3. For other, see: https://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf

If the distribution does satisfy the normality criterion, log-transform the variable to identify outliers more clearly.
Delete those observations.
Try step 1 again.

Boxplots for numerical variables

library(ggplot2)
  1. Create a dummy variable with the same mean and standard deviation, but normally distributed, for comparison purposes:

    absences_norm <- rnorm(200,mean=mean(absences, na.rm=TRUE), sd=sd(absences, na.rm=TRUE))
  2. Plot the two variables side by side

    boxplot(absences, absences_norm, main = "Boxplots of the absences variable: actual distribution vs. normal", at = c(1,2), names = c("absences","absences_norm"), las = 2, col = c("pink","yellow"), horizontal = TRUE)

Сorrelation plots

library(tidyverse), library(MASS)

Option 1

  1. Calculate the correlation matrix and round it (tidyverse and corrplot packages)

    cor_matrix <- cor(Boston) 
    cor_matrix_r <- cor_matrix %>% round(2)
    # округлить до 2 цифр после запятой
  2. Visualize the correlation matrix

    corrplot(cor_matrix_r, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

    Option 2

  3. Use pairs() command (or ggpairs with GGally package)
  4. With pairs() you can reduce the number of pairs to see the plots more clearly. After the data argument, you have to specify the columns you want to see. Ex.: pairs(Boston[6:10]), col = km$cluster)

Сenter and standardize numerical variables

scale()
boston_scaled <- scale(Boston), # mean = 0, центрирует вокруг нуля
class(boston_scaled) # в результате шкалирования получается матрица
boston_scaled <- as.data.frame(boston_scaled) # меняем матрицу на data frame

Create a categorical variable form a numerical one

Quantiles
1. Create a quantile vector of crim and print it

bins <- quantile(boston_scaled$crim)
bins
  1. Create a categorical variable ‘crime’

    crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))
  2. Look at the table of the new factor crime

    table(crime)
  3. Remove original crim from the dataset

    boston_scaled <- dplyr::select(boston_scaled, -crim)
  4. Add the new categorical value to scaled data

    boston_scaled <- data.frame(boston_scaled, crime)

Linear regression

We are now interested in exploring whether we can explain the student exam grade by relating it with other available variables. A multiple regression model is fitted with a response variable ‘points’ and explanatory variables ‘attitude’, ‘stra’ and ‘surf’.

lm <- lm(points ~ attitude + stra + surf, data = learning2014)
lm
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Coefficients:
## (Intercept)     attitude         stra         surf  
##     11.0171       3.3952       0.8531      -0.5861
summary(lm)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The model with three explanatory variables is able to account for 21% of the variance (multiple R squared). It shows that the only variable with a statistically significant relationship with examn grade is ‘attitude’. Now we have to try out and take the two other factors one by one in a search of a more parsimonious model.

lm1 <- lm(points ~ attitude + stra, data = learning2014)
lm1
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Coefficients:
## (Intercept)     attitude         stra  
##      8.9729       3.4658       0.9137
summary(lm1)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
lm2 <- lm(points ~ attitude + surf, data = learning2014)
lm2
## 
## Call:
## lm(formula = points ~ attitude + surf, data = learning2014)
## 
## Coefficients:
## (Intercept)     attitude         surf  
##      14.120        3.426       -0.779
summary(lm2)
## 
## Call:
## lm(formula = points ~ attitude + surf, data = learning2014)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.277  -3.236   0.386   3.977  10.642 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.1196     3.1271   4.515 1.21e-05 ***
## attitude      3.4264     0.5764   5.944 1.63e-08 ***
## surf         -0.7790     0.7956  -0.979    0.329    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 163 degrees of freedom
## Multiple R-squared:  0.1953, Adjusted R-squared:  0.1854 
## F-statistic: 19.78 on 2 and 163 DF,  p-value: 2.041e-08

Taking out the ‘surf’ or ‘stra’ variables did not affect the multiple R squared and thus the fit of the model. It makes it safe to conclude that they do not affect the exam grade and we can proceed with the only significant factor ‘attitude’.

lm3 <- lm(points ~ attitude, data = learning2014)
lm3
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Coefficients:
## (Intercept)     attitude  
##      11.637        3.525
summary(lm3)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The fitted model explains 19% of the variance, the attitude is concluded to be a statistically significant predictor for the exam grade.

Residual analysis

Next, to make sure that the model is fitted correctly, residuals analysis is due.

Residuals vs. Fitted values

plot(lm3, which = c(1))

QQ plot (theoretical quantiles)

plot(lm3, which = c(2))

Residuals vs. Leverage

plot(lm3, which = c(5))

The residuals vs. fitted values are equally distributed and do not demonsrate any pattern. The QQ plot shows a nice fit along the line with no cause for concern. The Leverage plot does nor show any particular outliers.
The model is valid.

Prediction. Train and test sets

  1. Determine the number of rows in the dataset

    n <- nrow(boston_scaled)
  2. Choose randomly 80% of the rows

    ind <- sample(n,  size = n * 0.8)
  3. Create train set

    train <- boston_scaled[ind,]
  4. Create test set by substraction the train set

    test <- boston_scaled[-ind,]
  5. Save the correct classes from test data

    correct_classes <- test$crime (any variable used for prediction)
  6. Remove this variable from test data

    test <- dplyr::select(test, -crime)


Chapter 3. Logistic regression. Analysis

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features; it was collected by using school reports and questionnaires.

alc <- read.csv("./data/alc.csv")

The dataset contains 35 variables and 382 observations pertaining to the performance, family, personal life and other information about students and their consumption of alcohol.

Here, I am interested in studying the relationships between high/low alcohol consumption and other variables in the data, namely, four of them: ‘absences’ (number of school absences), ‘traveltime’ (home to school travel time), ‘goout’ (going out with friends), and ‘famrel’ (quality of family relationships).

I hypothesize that the high number of absences and a lot of time with friends along with bad family relationships and short traveltime will increase odds of high alcohol consumption in students.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
keep_columns <- c("sex", "age", "absences", "traveltime", "goout", "famrel", "high_use")
alc <- select(alc, one_of(keep_columns))
sex <- alc$sex
age <- alc$age
absences <- alc$absences
traveltime <- alc$traveltime
goout <- alc$goout
famrel <- alc$famrel
high_use <- alc$high_use

We have now reformatted the dataset to only keep the columns pertaining to the hypothesis and two general variables, sex and age. Now onto the graphical exploration.

library(ggplot2)
# create a dummy variable with the same mean and standard deviation, but normally distributed, for comparison purposes
absences_norm <- rnorm(200,mean=mean(absences, na.rm=TRUE), sd=sd(absences, na.rm=TRUE))
# plot the two variables side by side
boxplot(absences, absences_norm, main = "Boxplots of the absences variable: actual distribution vs. normal", at = c(1,2), names = c("absences","absences_norm"), las = 2, col = c("pink","yellow"), horizontal = TRUE)

First, we need to analyze the only numeric variable, the number of absences. As seen in the boxplot, the actual distribution is within normal limits (with no negative values) and with a few outliers. Before excluding those observations, we will explore other variable and try and fit the model.
Now we will plot the interval variables with a scale of answers from 1 to 5.

boxplot(traveltime, goout, famrel, main = "Boxplots of interval variables", at = c(1,2,3), names = c("travel time","time with friends", "family relationships"), las = 2, col = c("blue","violet","green"))

The “goout” variable seems to be distributed normally, although two others are skewed in different directions. Next, we will plot each variable against the response variable and explore the dataset further.

library(GGally)
cross <- ggpairs(alc, mapping = aes(col = sex), lower = list(combo = wrap("facethist", bins = 20)))
cross

The variables have very low correlation values between themselves, so there is no risk of covariance tamperinf with the model. Now onto the model fitting.

m <- glm(high_use ~ absences + traveltime + goout + famrel, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + traveltime + goout + famrel, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0083  -0.7805  -0.5437   0.8464   2.3984  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.97770    0.68176  -4.368 1.26e-05 ***
## absences     0.07559    0.02202   3.432 0.000598 ***
## traveltime   0.43957    0.17532   2.507 0.012167 *  
## goout        0.76735    0.12087   6.348 2.18e-10 ***
## famrel      -0.36294    0.13756  -2.638 0.008331 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 389.46  on 377  degrees of freedom
## AIC: 399.46
## 
## Number of Fisher Scoring iterations: 4

All of the explanatory variables have a statistically significant relationship with the target variable.
Next, we will calculate the odds ratios and confidence intervals to validate our model.

odds_ratios <- coef(m) %>% exp
conf_int <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(odds_ratios, conf_int)
##             odds_ratios      2.5 %    97.5 %
## (Intercept)   0.0509099 0.01285711 0.1879144
## absences      1.0785172 1.03437018 1.1291483
## traveltime    1.5520466 1.10198801 2.1968917
## goout         2.1540600 1.71012923 2.7497327
## famrel        0.6956280 0.52984131 0.9103098

The odds of high alcohol consumption for: 1. absent students is from 1.03 and 1.12 higher than for attending students.
2. students with shorter travel time to school is 1.10 to 2.19 higher than for those who have to take the long road.
3. students who spend a lot of time with their friends is from 1.7 to 2.7 times higher for students who do not.
4. students with a good family situation is between 52% and 91% of the odds of students with bad family relationships.

To continue fitting the model, we will compare predicted and actual values.

# predict the probability of high_use
probabilities <- predict(m, type = "response")

alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = (probability > 0.5))
select(alc, absences, traveltime, goout, famrel, high_use, probability, prediction) %>% head(10)
##    absences traveltime goout famrel high_use probability prediction
## 1         5          2     4      4    FALSE  0.47428353      FALSE
## 2         3          1     3      5    FALSE  0.13895457      FALSE
## 3         8          1     2      4     TRUE  0.13581670      FALSE
## 4         1          1     2      3    FALSE  0.11746601      FALSE
## 5         2          1     2      4    FALSE  0.09079210      FALSE
## 6         8          1     2      5    FALSE  0.09855191      FALSE
## 7         0          1     4      4    FALSE  0.28486278      FALSE
## 8         4          2     4      4    FALSE  0.45548223      FALSE
## 9         0          1     2      4    FALSE  0.07906088      FALSE
## 10        0          1     1      5    FALSE  0.02697575      FALSE
prediction_plot <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
prediction_plot + geom_point()

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64921466 0.05235602 0.70157068
##    TRUE  0.18586387 0.11256545 0.29842932
##    Sum   0.83507853 0.16492147 1.00000000

Not totally amazing, but still. The penultimate step would be to calculate the accuracy of the model.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the training data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2382199

Quite a good result!
Finally, we perform the 10-fold cross-validation.

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2513089

Sliiiiightly better than 0.26 by the DataCamp model.


Chapter 4. Clustering and classification

Data wrangling

The first thing in today’s agenda is to load the Boston dataset from r.

# loading
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

# initial dataset exploration
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

This dataset contains 506 observations over 14 variables, 2 of them interval and the other ones numerical. The indices were measured across the following variables:
1. ‘crim’ (per capita crime rate by town)
2. ‘zn’ (proportion of residential land zoned for lots over 25,000 sq.ft)
3. ‘indus’ (proportion of non-retail business acres per town)
4. ‘chas’ (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise))
5. ‘nox’ (nitrogen oxides concentration (parts per 10 million)) 6. ‘rm’ (average number of rooms per dwelling)
7. ‘age’ (proportion of owner-occupied units built prior to 1940) 8. ‘dis’ (weighted mean of distances to five Boston employment centres)
9. ‘rad’ (index of accessibility to radial highways) 10. ‘tax’ (full-value property-tax rate per $10,000) 11. ‘ptratio’ (pupil-teacher ratio by town)
12. ‘black’ (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town)
13. ‘lstat’ (lower status of the population (percent))
14. ‘medv’ (median value of owner-occupied homes in $1000s)

The variables have very different ranges, which probably means standartisation before the analysis. Now onto a graphical overview.

pairs(Boston)

A lot of variable seem to be strongly correlated. The plot for ‘nox’ and ‘dis’, for instance, are seemingly related through 1/x function. It probably makes sense since air pollution is dissipating the farther you from the city center.

To check whether this is just an observation or a valid hypothesis, let’s check for correlations.

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
# round up to 2 digits
cor_matrix_r <- cor_matrix
round(cor_matrix_r, 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor_matrix_r, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

The hypothesis is correct, the correlation coefficient is -0.77.

Data preparation

For classification and clustering purposes, next step would be to scale the dataset to avoid skewing results.

# scaling around 0
boston_scaled <- scale(Boston)

# getting a matrix as a result
class(boston_scaled)
## [1] "matrix"
# transforming the matrix into a new dataset
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Now the variables are on the same scale with a mean of 0. In order to work with classification, we need a binary or multiclass target variable. In this instance, we are most interested in classifying neighbourhoods accroding to crime rates, that is why we will next transform the ‘crim’ variable into categorical one via quantiles.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

We are done with data preparation and now can go on separating the test and training sets for further model training and validation.

# determine the number of rows in the dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create a train set
train <- boston_scaled[ind,]

# create a test set by substraction of the train set
test <- boston_scaled[-ind,]

All the steps in preparing the data are complete.

Linear Discriminant Analysis

Fitting the Linear Discriminant Analysis to the train data:

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2500000 0.2500000 0.2425743 
## 
## Group means:
##                  zn      indus          chas        nox         rm
## low       0.9695518 -0.8986348 -0.0830454025 -0.8756404  0.4878057
## med_low  -0.1348835 -0.3134450  0.0005392655 -0.5531867 -0.1342093
## med_high -0.3751653  0.2211521  0.2734075986  0.4479629  0.1463542
## high     -0.4872402  1.0171960 -0.0312821145  1.0898970 -0.4303178
##                 age        dis        rad        tax      ptratio
## low      -0.8820174  0.7749371 -0.6815029 -0.7621389 -0.525308537
## med_low  -0.2765583  0.2778917 -0.5565973 -0.4831296 -0.006900653
## med_high  0.4503409 -0.4079728 -0.3769361 -0.2766939 -0.343039722
## high      0.8216375 -0.8567975  1.6373367  1.5134896  0.779855168
##                black       lstat        medv
## low       0.38267389 -0.78822123  0.58650165
## med_low   0.31265819 -0.14589919  0.01721007
## med_high  0.08389883  0.01674961  0.18385744
## high     -0.66159869  0.87963513 -0.67824701
## 
## Coefficients of linear discriminants:
##                 LD1          LD2          LD3
## zn       0.08288436  0.712371236 -0.873055592
## indus    0.01371924 -0.253069409  0.026522722
## chas    -0.07247237 -0.067707365  0.087303367
## nox      0.42893505 -0.738898977 -1.219089496
## rm      -0.07377746 -0.139768243 -0.268140533
## age      0.29610796 -0.383542312 -0.048968773
## dis     -0.02538277 -0.386201738  0.151613840
## rad      2.94034144  1.063946739 -0.349412621
## tax      0.05341197 -0.158988334  0.780717136
## ptratio  0.10550874  0.002075486 -0.017374248
## black   -0.11256505 -0.006756739  0.026240739
## lstat    0.20540496 -0.206044671  0.367271246
## medv     0.19818226 -0.372006991 -0.003011808
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9411 0.0440 0.0149

The mean group results demonstrate that the three variables that are mostly influencing high crime rates are the index of accessibility to radial highways (1.63), full-value property-tax rate per $10,000 (1.51), and, interestingly, pollution rates (1.054).

The linear discriminants coefficient confirm those findings for the radial highways accessaibility which is three times higher than all the other variables in the dataset.

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "black", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)

The plot confirms the prediction for the biggest influence of accessibility to radial highways.

Prediction

Next, we will test the model that we have fitted on the test set, which we have previously separated from the data.

# save the correct classes from test data
correct_classes <- test$crime

# remove this variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      11        0    0
##   med_low    7      15        3    0
##   med_high   0      14       11    0
##   high       0       0        0   29

The model looks quite adequate, there is a just a small number of mismatches in the classification.

K-means clustering

Next, after trying to assign the measurements to previously determined classes, now we well turn to unsupervised methods. K-clustering begins, as usual, with loading and scaling data.

# loading
library(MASS)
data("Boston")

# scaling around 0
boston_scaled <- scale(Boston)

# getting a matrix as a result
class(boston_scaled)
## [1] "matrix"
# transforming the matrix into a new dataset
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

After the standartisation is complete, we need to calculate the distances for the scaled data.

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Next, we will try out a random number of clusters.

# k-means clustering
km <-kmeans(boston_scaled, centers = 10)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

The plot looks very colourful, but is is obvious that the number of clusters is too big. Right now we will reduce it in half.

# k-means clustering
km <-kmeans(boston_scaled, centers = 5)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

Looks better, but still confusing. It it time to more mathematical methods of identifying the right number of clusters for this dataset.

set.seed(123)

# determine the maximum number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')

According to the total sum of squares plot, the biggest observable “elbow” is set at 2. Therefore, we will run the k-means analysis with 2 centroids.

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

It turns out, that the most fitting number of classes for this dataset was just 2. A bit underwhelming though.